Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets

نویسندگان

  • Adam Nickerson
  • Nathalie Japkowicz
  • Evangelos E. Milios
چکیده

The class imbalance problem causes a classier to overt the data belonging to the class with the greatest number of training examples. The purpose of this paper is to argue that methods that equalize class membership are not as e ective as possible when applied blindly and that improvements can be obtained by adjusting for the within-class imbalance. A guided resampling technique is proposed and tested within a simpler letter recognition domain and a more diÆcult text classi cation domain. A fast unsupervised clustering technique, Principal Direction Divisive Partitioning (PDDP), is used to determine the internal characteristics of each class. The performance improvement in categories that su er from a large between-class imbalance (few positive examples) are shown to be improved when using the guided resampling method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

0950-7051/$ see front matter 2011 Elsevier B.V. A doi:10.1016/j.knosys.2011.06.013 ⇑ Corresponding author. E-mail addresses: [email protected] (V. García), s [email protected] (R.A. Mollineda). The present paper investigates the influence of both the imbalance ratio and the classifier on the performance of several resampling strategies to deal with imbalanced data sets. The study focuses on evaluat...

متن کامل

Evolutionary rule-based systems for imbalanced data sets

This paper investigates the capabilities of evolutionary online rule-based systems, also called Learning Classifier Systems (LCSs), for extracting knowledge from imbalanced data. While some learners may suffer from class imbalances and instances sparsely distributed around the feature space, we show that LCSs are flexible methods that can be adapted to detect such cases and find suitable models...

متن کامل

Dealing with Difficult Minority Labels in Imbalanced Mutilabel Data Sets

Multilabel classification is an emergent data mining task with a broad range of real world applications. Learning from imbalanced multilabel data is being deeply studied latterly, and several resampling methods have been proposed in the literature. The unequal label distribution in most multilabel datasets, with disparate imbalance levels, could be a handicap while learning new classifiers. In ...

متن کامل

Training algorithms for Radial Basis Function Networks to tackle learning processes with imbalanced data-sets

Nowadays, many real applications comprise data-sets where the distribution of the classes is significantly different. These data-sets are commonly known as imbalanced data-sets. Traditional classifiers are not able to deal with these kinds of data-sets because they tend to classify only majority classes, obtaining poor results for minority classes. The approaches that have been proposed to addr...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001